This course will be an introduction to R and Github.
Here is the link to my IODS page
link
setwd("/Users/obruck/Desktop/Open data course/IODS-project/data/")
learning2014 <- read.csv("learning2014.csv", header=TRUE)
str(learning2014)
## 'data.frame': 166 obs. of 8 variables:
## $ X : int 1 2 3 4 5 6 7 8 9 10 ...
## $ gender : Factor w/ 2 levels "F","M": 1 2 1 2 2 1 2 1 2 1 ...
## $ Age : int 53 55 49 53 49 38 50 37 37 42 ...
## $ Attitude: int 37 31 25 35 37 38 35 29 38 21 ...
## $ Deep : int 43 35 42 42 44 57 46 39 52 48 ...
## $ Stra : int 27 22 29 25 29 29 18 32 34 28 ...
## $ Surf : int 31 38 27 27 34 29 23 34 26 36 ...
## $ Points : int 25 12 24 10 22 21 21 31 24 26 ...
dim(learning2014)
## [1] 166 8
The “learning2014” dataset constitutes of 166 observations and 8 variables. The variables (gender, Age, Attitude, Deep, Stra, Surf and Points) describe the results of a query study made as a collaboration international project with teachers.
summary(learning2014)
## X gender Age Attitude Deep
## Min. : 1.00 F:110 Min. :17.00 Min. :14.00 Min. :19.00
## 1st Qu.: 42.25 M: 56 1st Qu.:21.00 1st Qu.:26.00 1st Qu.:40.00
## Median : 83.50 Median :22.00 Median :32.00 Median :44.00
## Mean : 83.50 Mean :25.51 Mean :31.43 Mean :44.16
## 3rd Qu.:124.75 3rd Qu.:27.00 3rd Qu.:37.00 3rd Qu.:49.00
## Max. :166.00 Max. :55.00 Max. :50.00 Max. :59.00
## Stra Surf Points
## Min. :10.00 Min. :19.00 Min. : 7.00
## 1st Qu.:21.00 1st Qu.:29.00 1st Qu.:19.00
## Median :25.50 Median :34.00 Median :23.00
## Mean :24.97 Mean :33.45 Mean :22.72
## 3rd Qu.:29.00 3rd Qu.:38.00 3rd Qu.:27.75
## Max. :40.00 Max. :52.00 Max. :33.00
There are 110 female and 56 male students. The median age is 22 years and range [17-55]. Students got a median of 44 points [19-59] in the deep approach (Deep), 25.50 [10-40] in the strategic approach (Stra), 34 points [19-52] in the surface approach (Surf). Their max points were 23 [7-33] as a median (Points).
pairs(learning2014[-1], col=learning2014$gender)
The data is visualised according to gender (males as black dots and females as red dots) on multiple bivariable correlation scatter plots.
The data can also be visualised using the “GGally” and “ggplot2” packages
library("GGally")
library("ggplot2")
p <- ggpairs(learning2014[-1], mapping = aes(col = gender, alpha = 0.3), lower = list(combo = wrap("facethist", bins = 20)))
p
Now we can see that male students have a higher score in attitude questions and female students in strategic questions. We can also see that Variables Points and Attitude have the highest (cor: 0.43) and Surf and Deep second highest correlation factor (cor: 0.32).
The “Points” student achieve is tested with a multivariate linear regression model using Attitude, Stra and Surf as independent variables. The model tries to describe any linear correlation between the independent and dependent variables.
points_regression <- lm(Points ~ Attitude + Stra + Surf, data = learning2014)
summary(points_regression)
##
## Call:
## lm(formula = Points ~ Attitude + Stra + Surf, data = learning2014)
##
## Residuals:
## Min 1Q Median 3Q Max
## -17.1550 -3.4346 0.5156 3.6401 10.8952
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 11.01711 3.68375 2.991 0.00322 **
## Attitude 0.33952 0.05741 5.913 1.93e-08 ***
## Stra 0.10664 0.06770 1.575 0.11716
## Surf -0.04884 0.06678 -0.731 0.46563
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 5.296 on 162 degrees of freedom
## Multiple R-squared: 0.2074, Adjusted R-squared: 0.1927
## F-statistic: 14.13 on 3 and 162 DF, p-value: 3.156e-08
In the first model with all three variables, Attitude explains the Points with a coefficient estimate of 0.34 (p<0.0001). Other variables Stra and Surf do not explain Points with a statistical significance. Thus, first Surf is removed and the model is tested again.
points_regression <- lm(Points ~ Attitude + Stra, data = learning2014)
summary(points_regression)
##
## Call:
## lm(formula = Points ~ Attitude + Stra, data = learning2014)
##
## Residuals:
## Min 1Q Median 3Q Max
## -17.6436 -3.3113 0.5575 3.7928 10.9295
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 8.97290 2.39591 3.745 0.00025 ***
## Attitude 0.34658 0.05652 6.132 6.31e-09 ***
## Stra 0.11421 0.06681 1.709 0.08927 .
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 5.289 on 163 degrees of freedom
## Multiple R-squared: 0.2048, Adjusted R-squared: 0.1951
## F-statistic: 20.99 on 2 and 163 DF, p-value: 7.734e-09
Attitude is still the only relevant variable and, thus, Stra is removed and the model is run with Attitude alone (coefficient 0.35, p<0.0001).
points_regression <- lm(Points ~ Attitude, data = learning2014)
summary(points_regression)
##
## Call:
## lm(formula = Points ~ Attitude, data = learning2014)
##
## Residuals:
## Min 1Q Median 3Q Max
## -16.9763 -3.2119 0.4339 4.1534 10.6645
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 11.63715 1.83035 6.358 1.95e-09 ***
## Attitude 0.35255 0.05674 6.214 4.12e-09 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 5.32 on 164 degrees of freedom
## Multiple R-squared: 0.1906, Adjusted R-squared: 0.1856
## F-statistic: 38.61 on 1 and 164 DF, p-value: 4.119e-09
The adjusted R-squared score is 0.19 meaning that using this model, Attitude explains 19% of the total variance of the dependent variable Points.
plot(points_regression, which = c(1,2,5))
Residuals of the linear regression model describe the validity of the model. Residuals should be normally distributed, they should not correlate with each others, have a constant varaiance and their size should not depend on the size of the independent variable.
The QQ-plot describes the distribution of the residuals. Our model is normally divided as the measurements are set on straight increasing line.
The Residuals vs Fitted plot describes how the residuals depend on the explanatory variable. According to the plot, the resudials are resonably evenly distributed on both sides of the residual level = 0 line and, thus, do not depend on the size of the explanatory variable.
The Residuals vs Leverage plot can help identify, which observations have an unusually high impact on the model. According to the plot, all measurements are divided between [0,0.05] and thus none have an unusual impact on the model.
setwd("/Users/obruck/Desktop/Open data course/IODS-project/data/")
alc <- read.csv("alc.csv", header = TRUE)
colnames(alc)
## [1] "X" "school" "sex" "age" "address"
## [6] "famsize" "Pstatus" "Medu" "Fedu" "Mjob"
## [11] "Fjob" "reason" "nursery" "internet" "guardian"
## [16] "traveltime" "studytime" "failures" "schoolsup" "famsup"
## [21] "paid" "activities" "higher" "romantic" "famrel"
## [26] "freetime" "goout" "Dalc" "Walc" "health"
## [31] "absences" "G1" "G2" "G3" "alc_use"
## [36] "high_use"
dim(alc)
## [1] 382 36
The data constitutes of 382 observations and 36 variables. Their names are presented in the above and describe the correlation between alcohol usage and the social, gender and study time attributes for each student.
Data Set Information:
P. Cortez and A. Silva. Using Data Mining to Predict Secondary School Student Performance. In A. Brito and J. Teixeira Eds., Proceedings of 5th FUture BUsiness TEChnology Conference (FUBUTEC 2008) pp. 5-12, Porto, Portugal, April, 2008, EUROSIS, ISBN 978-9077381-39-7.
Attribute Information:
1 school - student’s school (binary: ‘GP’ - Gabriel Pereira or ‘MS’ - Mousinho da Silveira)
2 sex - student’s sex (binary: ‘F’ - female or ‘M’ - male)
3 age - student’s age (numeric: from 15 to 22)
4 address - student’s home address type (binary: ‘U’ - urban or ‘R’ - rural)
5 famsize - family size (binary: ‘LE3’ - less or equal to 3 or ‘GT3’ - greater than 3)
6 Pstatus - parent’s cohabitation status (binary: ‘T’ - living together or ‘A’ - apart)
7 Medu - mother’s education (numeric: 0 - none, 1 - primary education (4th grade), 2 – 5th to 9th grade, 3 – secondary education or 4 – higher education)
8 Fedu - father’s education (numeric: 0 - none, 1 - primary education (4th grade), 2 – 5th to 9th grade, 3 – secondary education or 4 – higher education)
9 Mjob - mother’s job (nominal: ‘teacher’, ‘health’ care related, civil ‘services’ (e.g. administrative or police), ‘at_home’ or ‘other’)
10 Fjob - father’s job (nominal: ‘teacher’, ‘health’ care related, civil ‘services’ (e.g. administrative or police), ‘at_home’ or ‘other’)
11 reason - reason to choose this school (nominal: close to ‘home’, school ‘reputation’, ‘course’ preference or ‘other’)
12 nursery - attended nursery school (binary: yes or no)
13 internet - Internet access at home (binary: yes or no)
14 guardian - student’s guardian (nominal: ‘mother’, ‘father’ or ‘other’)
15 traveltime - home to school travel time (numeric: 1 - <15 min., 2 - 15 to 30 min., 3 - 30 min. to 1 hour, or 4 - >1 hour)
16 studytime - weekly study time (numeric: 1 - <2 hours, 2 - 2 to 5 hours, 3 - 5 to 10 hours, or 4 - >10 hours)
17 failures - number of past class failures (numeric: n if 1<=n<3, else 4)
18 schoolsup - extra educational support (binary: yes or no)
19 famsup - family educational support (binary: yes or no)
20 paid - extra paid classes within the course subject (Math or Portuguese) (binary: yes or no)
21 activities - extra-curricular activities (binary: yes or no)
22 higher - wants to take higher education (binary: yes or no)
23 romantic - with a romantic relationship (binary: yes or no)
24 famrel - quality of family relationships (numeric: from 1 - very bad to 5 - excellent)
25 freetime - free time after school (numeric: from 1 - very low to 5 - very high)
26 goout - going out with friends (numeric: from 1 - very low to 5 - very high)
27 Dalc - workday alcohol consumption (numeric: from 1 - very low to 5 - very high)
28 Walc - weekend alcohol consumption (numeric: from 1 - very low to 5 - very high)
29 health - current health status (numeric: from 1 - very bad to 5 - very good)
30 absences - number of school absences (numeric: from 0 to 93)
31 G1 - first period grade (numeric: from 0 to 20)
32 G2 - second period grade (numeric: from 0 to 20)
33 G3 - final grade (numeric: from 0 to 20, output target)
34 alc_use - average of ‘Dalc’ and ‘Walc’
35 high_use - TRUE if ‘alc_use’ is higher than 2 and FALSE otherwise
High amount of ‘absences’, no ‘famsup’, low ‘studytime’ and bad ‘famrel’ correlates with a higher alcohol consumption
library(dplyr)
##
## Attaching package: 'dplyr'
## The following object is masked from 'package:GGally':
##
## nasa
## The following objects are masked from 'package:stats':
##
## filter, lag
## The following objects are masked from 'package:base':
##
## intersect, setdiff, setequal, union
library(ggplot2)
g1 <- ggplot(alc, aes(x = alc_use, y = absences, fill = sex))
g1 + geom_bar(stat="identity") + ylab("absences") + ggtitle("Student absences by alcohol consumption and sex")
g2 <- ggplot(alc, aes(x = alc_use, y = (famsup = "yes"), fill=sex))
g2 + geom_bar(stat="identity") + ylab("family educational support") + ggtitle("Student family educational support by alcohol consumption")
g3 <- ggplot(data = alc, aes(x = alc_use, y = studytime, fill=sex))
g3 + geom_bar(stat="identity") + ylab("weekly study time") + ggtitle("Student weekly study time by alcohol consumption and sex")
g4 <- ggplot(alc, aes(x = alc_use, y = famrel, fill = sex))
g4 + geom_bar(stat="identity") + ylab("quality of family relationships") + ggtitle("Student quality of family relationships by alcohol consumption and sex")
alc %>% group_by(famsup) %>% summarise(count = n(), mean_alc_use = mean(alc_use))
## # A tibble: 2 × 3
## famsup count mean_alc_use
## <fctr> <int> <dbl>
## 1 no 144 1.965278
## 2 yes 238 1.842437
alc %>% group_by(alc_use) %>% summarise(count = n(), mean_absences = mean(absences))
## # A tibble: 9 × 3
## alc_use count mean_absences
## <dbl> <int> <dbl>
## 1 1.0 140 3.357143
## 2 1.5 69 4.231884
## 3 2.0 59 3.915254
## 4 2.5 44 6.431818
## 5 3.0 32 6.093750
## 6 3.5 17 5.647059
## 7 4.0 9 6.000000
## 8 4.5 3 12.000000
## 9 5.0 9 6.888889
alc %>% group_by(alc_use) %>% summarise(count = n(), mean_studytime = mean(studytime))
## # A tibble: 9 × 3
## alc_use count mean_studytime
## <dbl> <int> <dbl>
## 1 1.0 140 2.307143
## 2 1.5 69 1.971014
## 3 2.0 59 1.983051
## 4 2.5 44 1.931818
## 5 3.0 32 1.718750
## 6 3.5 17 1.470588
## 7 4.0 9 1.777778
## 8 4.5 3 2.000000
## 9 5.0 9 1.666667
alc %>% group_by(famrel) %>% summarise(count = n(), mean_alc_use = mean(alc_use))
## # A tibble: 5 × 3
## famrel count mean_alc_use
## <int> <int> <dbl>
## 1 1 8 2.250000
## 2 2 19 2.184211
## 3 3 64 2.000000
## 4 4 189 1.880952
## 5 5 102 1.750000
Lack of family educational support and bad quality of family relationships seem to correlate with higher alcohol consumption. Additionally, high alcohol consumption correlates with high absences from school. In addition, students with high alcohol consumption (alc_use > 3.0) study less than students with low alcohol comsumption (alc_use < 2.5). These findings are concordant with the primary hypotheses.
alc$famrel <- factor(alc$famrel)
alc$studytime <- factor(alc$studytime)
m <- glm(high_use ~ absences + famsup + alc$studytime + alc$famrel, data = alc, family = "binomial")
summary(m)
##
## Call:
## glm(formula = high_use ~ absences + famsup + alc$studytime +
## alc$famrel, family = "binomial", data = alc)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -2.1803 -0.8226 -0.6503 1.1522 2.2409
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -1.015455 0.878444 -1.156 0.247693
## absences 0.076934 0.022630 3.400 0.000675 ***
## famsupyes -0.061188 0.245702 -0.249 0.803337
## alc$studytime2 -0.439698 0.267726 -1.642 0.100520
## alc$studytime3 -1.342617 0.442054 -3.037 0.002388 **
## alc$studytime4 -1.279644 0.588173 -2.176 0.029583 *
## alc$famrel2 0.936030 0.964897 0.970 0.332005
## alc$famrel3 0.643055 0.883261 0.728 0.466585
## alc$famrel4 0.272456 0.858654 0.317 0.751012
## alc$famrel5 -0.006842 0.876748 -0.008 0.993773
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 465.68 on 381 degrees of freedom
## Residual deviance: 427.24 on 372 degrees of freedom
## AIC: 447.24
##
## Number of Fisher Scoring iterations: 4
OR <- coef(m) %>% exp
CI <- confint(m) %>% exp
## Waiting for profiling to be done...
cbind(OR, CI)
## OR 2.5 % 97.5 %
## (Intercept) 0.3622376 0.04877998 1.8120704
## absences 1.0799706 1.03533984 1.1316272
## famsupyes 0.9406468 0.58232256 1.5284716
## alc$studytime2 0.6442312 0.38110876 1.0908572
## alc$studytime3 0.2611613 0.10353967 0.5966021
## alc$studytime4 0.2781364 0.07612213 0.8066302
## alc$famrel2 2.5498396 0.42150543 21.4386194
## alc$famrel3 1.9022832 0.37564722 14.2026417
## alc$famrel4 1.3131852 0.27385709 9.4789868
## alc$famrel5 0.9931812 0.19882276 7.3471751
According to the logistic regression model higher absences and studying more than 5 hours a week correlates with lower alcohol consumption. Lack of family support and quality of family relationships do not correlate significantly with alcohol cosumption in this model.
probabilities <- predict(m, type = "response")
alc <- mutate(alc, probability = probabilities)
alc <- mutate(alc, prediction = probability > 0.5)
g <- ggplot(alc, aes(x = probability, y = high_use, col = prediction))
g + geom_point()
table(high_use = alc$high_use, prediction = alc$prediction)
## prediction
## high_use FALSE TRUE
## FALSE 253 15
## TRUE 94 20
table(high_use = alc$high_use, prediction = alc$prediction) %>% prop.table()
## prediction
## high_use FALSE TRUE
## FALSE 0.66230366 0.03926702
## TRUE 0.24607330 0.05235602
table(high_use = alc$high_use, prediction = alc$prediction) %>% prop.table() %>% addmargins()
## prediction
## high_use FALSE TRUE Sum
## FALSE 0.66230366 0.03926702 0.70157068
## TRUE 0.24607330 0.05235602 0.29842932
## Sum 0.90837696 0.09162304 1.00000000
loss_func <- function(class, prob) {
n_wrong <- abs(class - prob) > 0.5
mean(n_wrong)
}
loss_func(class = alc$high_use, prob = alc$probability)
## [1] 0.2853403
According to the results, 29% of the predictions were wrong. This is clearly less than simple guessing (50% predictions wrong), although not optimal.
library(boot)
#cv <- cv.glm(data = alc, cost = loss_func, glmfit = m, K = 10)
#cv
#cv$delta[1]
The test set performance is worse as the model makes 31% of predictions wrong.
#setwd("/Users/obruck/Desktop/Open data course/IODS-project/data/")
library(MASS)
##
## Attaching package: 'MASS'
## The following object is masked from 'package:dplyr':
##
## select
data("Boston")
str(Boston)
## 'data.frame': 506 obs. of 14 variables:
## $ crim : num 0.00632 0.02731 0.02729 0.03237 0.06905 ...
## $ zn : num 18 0 0 0 0 0 12.5 12.5 12.5 12.5 ...
## $ indus : num 2.31 7.07 7.07 2.18 2.18 2.18 7.87 7.87 7.87 7.87 ...
## $ chas : int 0 0 0 0 0 0 0 0 0 0 ...
## $ nox : num 0.538 0.469 0.469 0.458 0.458 0.458 0.524 0.524 0.524 0.524 ...
## $ rm : num 6.58 6.42 7.18 7 7.15 ...
## $ age : num 65.2 78.9 61.1 45.8 54.2 58.7 66.6 96.1 100 85.9 ...
## $ dis : num 4.09 4.97 4.97 6.06 6.06 ...
## $ rad : int 1 2 2 3 3 3 5 5 5 5 ...
## $ tax : num 296 242 242 222 222 222 311 311 311 311 ...
## $ ptratio: num 15.3 17.8 17.8 18.7 18.7 18.7 15.2 15.2 15.2 15.2 ...
## $ black : num 397 397 393 395 397 ...
## $ lstat : num 4.98 9.14 4.03 2.94 5.33 ...
## $ medv : num 24 21.6 34.7 33.4 36.2 28.7 22.9 27.1 16.5 18.9 ...
dim(Boston)
## [1] 506 14
The Boston data frame has 506 rows and 14 columns. The data frame contains the following variables: crim, zn, indus, chas, nox, rm, age, dis, rad, tax, ptratio, black, lstat and medv.
pairs(Boston)
library(dplyr)
cor_matrix<-cor(Boston)
cor_matrix
## crim zn indus chas nox
## crim 1.00000000 -0.20046922 0.40658341 -0.055891582 0.42097171
## zn -0.20046922 1.00000000 -0.53382819 -0.042696719 -0.51660371
## indus 0.40658341 -0.53382819 1.00000000 0.062938027 0.76365145
## chas -0.05589158 -0.04269672 0.06293803 1.000000000 0.09120281
## nox 0.42097171 -0.51660371 0.76365145 0.091202807 1.00000000
## rm -0.21924670 0.31199059 -0.39167585 0.091251225 -0.30218819
## age 0.35273425 -0.56953734 0.64477851 0.086517774 0.73147010
## dis -0.37967009 0.66440822 -0.70802699 -0.099175780 -0.76923011
## rad 0.62550515 -0.31194783 0.59512927 -0.007368241 0.61144056
## tax 0.58276431 -0.31456332 0.72076018 -0.035586518 0.66802320
## ptratio 0.28994558 -0.39167855 0.38324756 -0.121515174 0.18893268
## black -0.38506394 0.17552032 -0.35697654 0.048788485 -0.38005064
## lstat 0.45562148 -0.41299457 0.60379972 -0.053929298 0.59087892
## medv -0.38830461 0.36044534 -0.48372516 0.175260177 -0.42732077
## rm age dis rad tax
## crim -0.21924670 0.35273425 -0.37967009 0.625505145 0.58276431
## zn 0.31199059 -0.56953734 0.66440822 -0.311947826 -0.31456332
## indus -0.39167585 0.64477851 -0.70802699 0.595129275 0.72076018
## chas 0.09125123 0.08651777 -0.09917578 -0.007368241 -0.03558652
## nox -0.30218819 0.73147010 -0.76923011 0.611440563 0.66802320
## rm 1.00000000 -0.24026493 0.20524621 -0.209846668 -0.29204783
## age -0.24026493 1.00000000 -0.74788054 0.456022452 0.50645559
## dis 0.20524621 -0.74788054 1.00000000 -0.494587930 -0.53443158
## rad -0.20984667 0.45602245 -0.49458793 1.000000000 0.91022819
## tax -0.29204783 0.50645559 -0.53443158 0.910228189 1.00000000
## ptratio -0.35550149 0.26151501 -0.23247054 0.464741179 0.46085304
## black 0.12806864 -0.27353398 0.29151167 -0.444412816 -0.44180801
## lstat -0.61380827 0.60233853 -0.49699583 0.488676335 0.54399341
## medv 0.69535995 -0.37695457 0.24992873 -0.381626231 -0.46853593
## ptratio black lstat medv
## crim 0.2899456 -0.38506394 0.4556215 -0.3883046
## zn -0.3916785 0.17552032 -0.4129946 0.3604453
## indus 0.3832476 -0.35697654 0.6037997 -0.4837252
## chas -0.1215152 0.04878848 -0.0539293 0.1752602
## nox 0.1889327 -0.38005064 0.5908789 -0.4273208
## rm -0.3555015 0.12806864 -0.6138083 0.6953599
## age 0.2615150 -0.27353398 0.6023385 -0.3769546
## dis -0.2324705 0.29151167 -0.4969958 0.2499287
## rad 0.4647412 -0.44441282 0.4886763 -0.3816262
## tax 0.4608530 -0.44180801 0.5439934 -0.4685359
## ptratio 1.0000000 -0.17738330 0.3740443 -0.5077867
## black -0.1773833 1.00000000 -0.3660869 0.3334608
## lstat 0.3740443 -0.36608690 1.0000000 -0.7376627
## medv -0.5077867 0.33346082 -0.7376627 1.0000000
cor_matrix %>% round(2)
## crim zn indus chas nox rm age dis rad tax
## crim 1.00 -0.20 0.41 -0.06 0.42 -0.22 0.35 -0.38 0.63 0.58
## zn -0.20 1.00 -0.53 -0.04 -0.52 0.31 -0.57 0.66 -0.31 -0.31
## indus 0.41 -0.53 1.00 0.06 0.76 -0.39 0.64 -0.71 0.60 0.72
## chas -0.06 -0.04 0.06 1.00 0.09 0.09 0.09 -0.10 -0.01 -0.04
## nox 0.42 -0.52 0.76 0.09 1.00 -0.30 0.73 -0.77 0.61 0.67
## rm -0.22 0.31 -0.39 0.09 -0.30 1.00 -0.24 0.21 -0.21 -0.29
## age 0.35 -0.57 0.64 0.09 0.73 -0.24 1.00 -0.75 0.46 0.51
## dis -0.38 0.66 -0.71 -0.10 -0.77 0.21 -0.75 1.00 -0.49 -0.53
## rad 0.63 -0.31 0.60 -0.01 0.61 -0.21 0.46 -0.49 1.00 0.91
## tax 0.58 -0.31 0.72 -0.04 0.67 -0.29 0.51 -0.53 0.91 1.00
## ptratio 0.29 -0.39 0.38 -0.12 0.19 -0.36 0.26 -0.23 0.46 0.46
## black -0.39 0.18 -0.36 0.05 -0.38 0.13 -0.27 0.29 -0.44 -0.44
## lstat 0.46 -0.41 0.60 -0.05 0.59 -0.61 0.60 -0.50 0.49 0.54
## medv -0.39 0.36 -0.48 0.18 -0.43 0.70 -0.38 0.25 -0.38 -0.47
## ptratio black lstat medv
## crim 0.29 -0.39 0.46 -0.39
## zn -0.39 0.18 -0.41 0.36
## indus 0.38 -0.36 0.60 -0.48
## chas -0.12 0.05 -0.05 0.18
## nox 0.19 -0.38 0.59 -0.43
## rm -0.36 0.13 -0.61 0.70
## age 0.26 -0.27 0.60 -0.38
## dis -0.23 0.29 -0.50 0.25
## rad 0.46 -0.44 0.49 -0.38
## tax 0.46 -0.44 0.54 -0.47
## ptratio 1.00 -0.18 0.37 -0.51
## black -0.18 1.00 -0.37 0.33
## lstat 0.37 -0.37 1.00 -0.74
## medv -0.51 0.33 -0.74 1.00
#install.packages("corrplot")
library(corrplot)
corrplot(cor_matrix %>% round(2), method="circle")
corrplot(cor_matrix %>% round(2), method="circle", type="upper", cl.pos="b", tl.pos="d", tl.cex=0.6)
summary(Boston)
## crim zn indus chas
## Min. : 0.00632 Min. : 0.00 Min. : 0.46 Min. :0.00000
## 1st Qu.: 0.08204 1st Qu.: 0.00 1st Qu.: 5.19 1st Qu.:0.00000
## Median : 0.25651 Median : 0.00 Median : 9.69 Median :0.00000
## Mean : 3.61352 Mean : 11.36 Mean :11.14 Mean :0.06917
## 3rd Qu.: 3.67708 3rd Qu.: 12.50 3rd Qu.:18.10 3rd Qu.:0.00000
## Max. :88.97620 Max. :100.00 Max. :27.74 Max. :1.00000
## nox rm age dis
## Min. :0.3850 Min. :3.561 Min. : 2.90 Min. : 1.130
## 1st Qu.:0.4490 1st Qu.:5.886 1st Qu.: 45.02 1st Qu.: 2.100
## Median :0.5380 Median :6.208 Median : 77.50 Median : 3.207
## Mean :0.5547 Mean :6.285 Mean : 68.57 Mean : 3.795
## 3rd Qu.:0.6240 3rd Qu.:6.623 3rd Qu.: 94.08 3rd Qu.: 5.188
## Max. :0.8710 Max. :8.780 Max. :100.00 Max. :12.127
## rad tax ptratio black
## Min. : 1.000 Min. :187.0 Min. :12.60 Min. : 0.32
## 1st Qu.: 4.000 1st Qu.:279.0 1st Qu.:17.40 1st Qu.:375.38
## Median : 5.000 Median :330.0 Median :19.05 Median :391.44
## Mean : 9.549 Mean :408.2 Mean :18.46 Mean :356.67
## 3rd Qu.:24.000 3rd Qu.:666.0 3rd Qu.:20.20 3rd Qu.:396.23
## Max. :24.000 Max. :711.0 Max. :22.00 Max. :396.90
## lstat medv
## Min. : 1.73 Min. : 5.00
## 1st Qu.: 6.95 1st Qu.:17.02
## Median :11.36 Median :21.20
## Mean :12.65 Mean :22.53
## 3rd Qu.:16.95 3rd Qu.:25.00
## Max. :37.97 Max. :50.00
There seem to be a strong negative correlation between variables age (proportion of owner-occupied units built prior to 1940) and dis (weighted mean of distances to five Boston employment centres), nox (nitrogen oxides concentration) and dis, indus (proportion of non-retail business acres per town) and dis and lstat (lower status of the population) and medv (median value of owner-occupied homes in $1000s). On the other hand, there is strong positive correalation between indus and nox, indus and tax (full-value property-tax rate per $10,000), nox and age, nox and tax, rm (average number of rooms per dwelling) and medv and rad (index of accessibility to radial highways) and tax.
By looking at the difference between the medians and means and at their value respective to the min. and max. values, it seems that variables crim, zn, age, rad, tax, indus and black would be distributed non-parametrically and others perhaps parametrically. However, to make such conclusions, one should first run a test for normality (for instance Kolmogorov-Smirnov).
boston_scaled <- scale(Boston)
summary(boston_scaled)
## crim zn indus
## Min. :-0.419367 Min. :-0.48724 Min. :-1.5563
## 1st Qu.:-0.410563 1st Qu.:-0.48724 1st Qu.:-0.8668
## Median :-0.390280 Median :-0.48724 Median :-0.2109
## Mean : 0.000000 Mean : 0.00000 Mean : 0.0000
## 3rd Qu.: 0.007389 3rd Qu.: 0.04872 3rd Qu.: 1.0150
## Max. : 9.924110 Max. : 3.80047 Max. : 2.4202
## chas nox rm age
## Min. :-0.2723 Min. :-1.4644 Min. :-3.8764 Min. :-2.3331
## 1st Qu.:-0.2723 1st Qu.:-0.9121 1st Qu.:-0.5681 1st Qu.:-0.8366
## Median :-0.2723 Median :-0.1441 Median :-0.1084 Median : 0.3171
## Mean : 0.0000 Mean : 0.0000 Mean : 0.0000 Mean : 0.0000
## 3rd Qu.:-0.2723 3rd Qu.: 0.5981 3rd Qu.: 0.4823 3rd Qu.: 0.9059
## Max. : 3.6648 Max. : 2.7296 Max. : 3.5515 Max. : 1.1164
## dis rad tax ptratio
## Min. :-1.2658 Min. :-0.9819 Min. :-1.3127 Min. :-2.7047
## 1st Qu.:-0.8049 1st Qu.:-0.6373 1st Qu.:-0.7668 1st Qu.:-0.4876
## Median :-0.2790 Median :-0.5225 Median :-0.4642 Median : 0.2746
## Mean : 0.0000 Mean : 0.0000 Mean : 0.0000 Mean : 0.0000
## 3rd Qu.: 0.6617 3rd Qu.: 1.6596 3rd Qu.: 1.5294 3rd Qu.: 0.8058
## Max. : 3.9566 Max. : 1.6596 Max. : 1.7964 Max. : 1.6372
## black lstat medv
## Min. :-3.9033 Min. :-1.5296 Min. :-1.9063
## 1st Qu.: 0.2049 1st Qu.:-0.7986 1st Qu.:-0.5989
## Median : 0.3808 Median :-0.1811 Median :-0.1449
## Mean : 0.0000 Mean : 0.0000 Mean : 0.0000
## 3rd Qu.: 0.4332 3rd Qu.: 0.6024 3rd Qu.: 0.2683
## Max. : 0.4406 Max. : 3.5453 Max. : 2.9865
Scaling reduces the mean of each of the values. Thus, after scaling, the variables have all 0 as a mean and it is easier to inspect the distribution. The closer the median is to 0, the more parametrical the variable. Scaling does not transform the data such as logarithmic trasformation.
class(boston_scaled)
## [1] "matrix"
boston_scaled <- as.data.frame(boston_scaled)
scaled_crim <- boston_scaled$crim
summary(scaled_crim)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## -0.419400 -0.410600 -0.390300 0.000000 0.007389 9.924000
bins <- quantile(scaled_crim)
bins
## 0% 25% 50% 75% 100%
## -0.419366929 -0.410563278 -0.390280295 0.007389247 9.924109610
crime <- cut(scaled_crim, breaks = bins, include.lowest = TRUE, labels = c("low", "med_low", "med_high", "high"))
#remove original crim from the dataset
table(crime)
## crime
## low med_low med_high high
## 127 126 126 127
boston_scaled <- dplyr::select(boston_scaled, -crim)
boston_scaled <- data.frame(boston_scaled, crime)
n <- nrow(boston_scaled)
ind <- sample(n, size = n * 0.8)
train <- boston_scaled[ind,]
test <- boston_scaled[-ind,]
library(lda)
lda.fit <- lda(crime~., data = train)
lda.fit
## Call:
## lda(crime ~ ., data = train)
##
## Prior probabilities of groups:
## low med_low med_high high
## 0.2549505 0.2500000 0.2475248 0.2475248
##
## Group means:
## zn indus chas nox rm
## low 1.0269888 -0.9078168 -0.08120770 -0.8890268 0.47331579
## med_low -0.0516764 -0.3410104 -0.03844192 -0.5843650 -0.08641064
## med_high -0.3834775 0.1913781 0.23949396 0.4729551 0.11679990
## high -0.4872402 1.0171519 -0.07547406 1.0553796 -0.40415238
## age dis rad tax ptratio
## low -0.9244592 0.9037705 -0.6875068 -0.7309863 -0.43284553
## med_low -0.3794765 0.3823690 -0.5429521 -0.4807210 -0.09516574
## med_high 0.4793835 -0.4143945 -0.4512794 -0.3205060 -0.36422792
## high 0.8229498 -0.8650002 1.6377820 1.5138081 0.78037363
## black lstat medv
## low 0.37247437 -0.77191417 0.54335481
## med_low 0.32057834 -0.15480044 0.01936313
## med_high 0.09299831 0.07874694 0.17170553
## high -0.84367321 0.90601891 -0.72585960
##
## Coefficients of linear discriminants:
## LD1 LD2 LD3
## zn 0.06811299 0.65232780 -1.05486159
## indus 0.10212556 -0.12182655 0.28608966
## chas -0.10560628 -0.05828899 0.04065742
## nox 0.25112654 -0.83840366 -1.30375408
## rm -0.12200852 -0.10953855 -0.14899366
## age 0.21501996 -0.38720715 -0.11035129
## dis -0.01939596 -0.27927061 0.19856604
## rad 3.63457287 1.00121254 -0.08019350
## tax 0.12343766 -0.08991709 0.53986639
## ptratio 0.08115930 0.02258287 -0.33704736
## black -0.09415592 0.02560342 0.12762993
## lstat 0.26869931 -0.23796919 0.30767800
## medv 0.23176568 -0.38568281 -0.17241846
##
## Proportion of trace:
## LD1 LD2 LD3
## 0.9527 0.0369 0.0105
lda.arrows <- function(x, myscale = 1, arrow_heads = 0.1, color = "red", tex = 0.75, choices = c(1,2)){
heads <- coef(x)
arrows(x0 = 0, y0 = 0,
x1 = myscale * heads[,choices[1]],
y1 = myscale * heads[,choices[2]], col=color, length = arrow_heads)
text(myscale * heads[,choices], labels = row.names(heads),
cex = tex, col=color, pos=3)
}
classes <- as.numeric(train$crime)
plot(lda.fit, dimen = 2)
plot(lda.fit, dimen = 2, col=classes, pch=classes)
lda.arrows(lda.fit, myscale = 1)
correct_classes <- test$crime
test <- dplyr::select(test, -crime)
lda.pred <- predict(lda.fit, newdata = test)
table(correct = correct_classes, predicted = lda.pred$class)
## predicted
## correct low med_low med_high high
## low 10 14 0 0
## med_low 3 15 7 0
## med_high 1 8 14 3
## high 0 0 0 27
According to the results, the model predicts well the true values of the categorical crime variable. Most of the errors are related to the med_low group where the specificity is the lowest.
dist_eu <- dist(boston_scaled)
## Warning in dist(boston_scaled): NAs introduced by coercion
summary(dist_eu)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.1394 3.5270 4.9080 4.9390 6.2420 13.0000
dist_man <- dist(boston_scaled, method = "manhattan")
## Warning in dist(boston_scaled, method = "manhattan"): NAs introduced by
## coercion
k_max <- 10
twcss <- sapply(1:k_max, function(k){kmeans(dist_eu, k)$tot.withinss})
plot(1:k_max, twcss, type='b')
km <-kmeans(dist_eu, centers = 2)
pairs(Boston, col = km$cluster)
km <-kmeans(dist_eu, centers = 3)
lda.fit <- lda(km$cluster~., data = boston_scaled)
lda.fit
## Call:
## lda(km$cluster ~ ., data = boston_scaled)
##
## Prior probabilities of groups:
## 1 2 3
## 0.4426877 0.2075099 0.3498024
##
## Group means:
## zn indus chas nox rm age
## 1 -0.2309344 -0.4551440 -0.27232907 -0.4915087 -0.1311176 -0.2883074
## 2 1.3140079 -0.9068835 0.47759479 -0.7455383 1.0548777 -0.7434166
## 3 -0.4872402 1.1139831 0.06132349 1.0642908 -0.4598408 0.8058735
## dis rad tax ptratio black lstat
## 1 0.2864417 -0.5768315 -0.6147205 -0.006886361 0.3154274 -0.2702405
## 2 0.7952165 -0.5935800 -0.6372993 -0.976736455 0.3398644 -0.8793299
## 3 -0.8342410 1.0821251 1.1560103 0.588134874 -0.6007995 0.8636357
## medv crimemed_low crimemed_high crimehigh
## 1 -0.02274038 0.43750000 0.2589286 0.0000000
## 2 1.16160305 0.16190476 0.2761905 0.0000000
## 3 -0.66030777 0.06214689 0.2203390 0.7175141
##
## Coefficients of linear discriminants:
## LD1 LD2
## zn 0.172610013 -1.25618179
## indus 1.078970813 -0.12226466
## chas 0.004463932 -0.49433521
## nox 0.921052125 -0.16586799
## rm -0.040950198 -0.49922411
## age -0.067015888 0.07618846
## dis 0.138379223 -0.06104503
## rad 0.579745765 0.43033468
## tax 0.401343141 -0.66310399
## ptratio 0.374368029 0.11727097
## black -0.061203934 0.05315043
## lstat 0.376322744 -0.56677965
## medv 0.109367758 -0.69236403
## crimemed_low -0.389791316 0.15910228
## crimemed_high -0.486827228 -0.58792893
## crimehigh -0.207144258 -1.08365331
##
## Proportion of trace:
## LD1 LD2
## 0.7951 0.2049
plot(lda.fit, dimen = 2)
plot(lda.fit, dimen = 2, col=classes, pch=classes)
lda.arrows(lda.fit, myscale = 1)
Using 3 clusters for K-means analysis, the most influencial linear separators are nox and chas. This means that if the data would be divided in 3 classes, variables nox and chas would explain most of the dimentional difference and would thus be the best variables to predict possible new observations.
model_predictors <- dplyr::select(train, -crime)
###Check the dimensions
dim(model_predictors)
## [1] 404 13
dim(lda.fit$scaling)
## [1] 16 2
###Matrix multiplication
#matrix_product <- as.matrix(model_predictors) %*% lda.fit$scaling
#matrix_product <- as.data.frame(matrix_product)
#install.packages("plotly")
#library("plotly")
###Create a 3D plot (Cool!) of the columns of the matrix product
#plot_ly(x = matrix_product$LD1, y = matrix_product$LD2, z = matrix_product$LD3, type= 'scatter3d', mode='markers', color=train$crime)
#plot_ly(x = matrix_product$LD1, y = matrix_product$LD2, z = matrix_product$LD3, type= 'scatter3d', mode='markers', color=km$center)
The same results as above is visualised in 3D using the variable crime in the first plot and the number of clusters selected in the second to separate observations with colors. For some reason, the visualisation does not work after knitting the file.
setwd("/Users/obruck/Desktop/Open data course/IODS-project/data/")
human2 <- read.csv("human2.csv")
library(dplyr)
library(tidyr)
library(ggplot2)
library(stringr)
colnames(human2)[1] <- "Country"
human2 <- human2[ ,2:9]
#human2 <- select(human2, -Country)
dim(human2)
## [1] 155 8
str(human2)
## 'data.frame': 155 obs. of 8 variables:
## $ Edu2.FM : num 1.007 0.997 0.983 0.989 0.969 ...
## $ Labo.FM : num 0.891 0.819 0.825 0.884 0.829 ...
## $ Edu.Exp : num 17.5 20.2 15.8 18.7 17.9 16.5 18.6 16.5 15.9 19.2 ...
## $ Life.Exp : num 81.6 82.4 83 80.2 81.6 80.9 80.9 79.1 82 81.8 ...
## $ GNI : Factor w/ 155 levels "1,123","1,228",..: 132 109 124 112 113 111 103 123 108 93 ...
## $ Mat.Mor : int 4 6 6 5 6 7 9 28 11 8 ...
## $ Ado.Birth: num 7.8 12.1 1.9 5.1 6.2 3.8 8.2 31 14.5 25.3 ...
## $ Parli.F : num 39.6 30.5 28.5 38 36.9 36.9 19.9 19.4 28.2 31.4 ...
There are 9 variables and 155 observations.
The original data is from http://hdr.undp.org/en/content/human-development-index-hdi and has been retrieved, modified and analyzed by Tuomo Nieminen. The data combines several indicators from most countries in the world.
“Country” = Country name
Health and knowledge
“GNI” = Gross National Income per capita
“Life.Exp” = Life expectancy at birth
“Edu.Exp” = Expected years of schooling
“Mat.Mor” = Maternal mortality ratio
“Ado.Birth” = Adolescent birth rate
Empowerment
“Parli.F” = Percetange of female representatives in parliament
“Edu2.F” = Proportion of females with at least secondary education
“Edu2.M” = Proportion of males with at least secondary education
“Labo.F” = Proportion of females in the labour force
“Labo.M” " Proportion of males in the labour force
“Edu2.FM” = Edu2.F / Edu2.M
“Labo.FM” = Labo2.F / Labo2.M
human2 <- mutate(human2, GNI=str_replace(human2$GNI, pattern=",", replace ="") %>% as.numeric())
pairs(human2)
summary(human2)
## Edu2.FM Labo.FM Edu.Exp Life.Exp
## Min. :0.1717 Min. :0.1857 Min. : 5.40 Min. :49.00
## 1st Qu.:0.7264 1st Qu.:0.5984 1st Qu.:11.25 1st Qu.:66.30
## Median :0.9375 Median :0.7535 Median :13.50 Median :74.20
## Mean :0.8529 Mean :0.7074 Mean :13.18 Mean :71.65
## 3rd Qu.:0.9968 3rd Qu.:0.8535 3rd Qu.:15.20 3rd Qu.:77.25
## Max. :1.4967 Max. :1.0380 Max. :20.20 Max. :83.50
## GNI Mat.Mor Ado.Birth Parli.F
## Min. : 581 Min. : 1.0 Min. : 0.60 Min. : 0.00
## 1st Qu.: 4198 1st Qu.: 11.5 1st Qu.: 12.65 1st Qu.:12.40
## Median : 12040 Median : 49.0 Median : 33.60 Median :19.30
## Mean : 17628 Mean : 149.1 Mean : 47.16 Mean :20.91
## 3rd Qu.: 24512 3rd Qu.: 190.0 3rd Qu.: 71.95 3rd Qu.:27.95
## Max. :123124 Max. :1100.0 Max. :204.80 Max. :57.50
library(GGally)
library(corrplot)
#ggpairs(data=human2)
cor(human2) %>% corrplot()
Variables “Edu2.FM”, “Labo.FM”, “Life.Exp”, “Edu.Exp” and “Parli.F” seem to be parametrical as their mean and median are close to each others. On the other hand, “GNI”, “Mat.Mor” and “Ado.Birth” have quite differing median and mean values indicating the distribution is not parametrical.
Variables “Life.Exp” and “Edu.Exp”, “Mat.Mor” and “Ado.Birth” would have a positive correlation. Parameters “Life.Exp” and “Mat.Mor”, “Edu.Exp” and “Mat.Mor”, “Life.Exp” and “Ado.Birth”, and “Edu.Exp” and “Ado.Birth” have negative correlation.
pca_human <- prcomp(human2)
biplot(pca_human, choices = 1:2, cex=c(0.8,1), col=c("grey40", "deeppink2"))
human_scaled <- scale(human2)
pca_human <- prcomp(human_scaled)
s <- summary(pca_human)
s
## Importance of components:
## PC1 PC2 PC3 PC4 PC5 PC6
## Standard deviation 2.0708 1.1397 0.87505 0.77886 0.66196 0.53631
## Proportion of Variance 0.5361 0.1624 0.09571 0.07583 0.05477 0.03595
## Cumulative Proportion 0.5361 0.6984 0.79413 0.86996 0.92473 0.96069
## PC7 PC8
## Standard deviation 0.45900 0.32224
## Proportion of Variance 0.02634 0.01298
## Cumulative Proportion 0.98702 1.00000
#Rounded percetanges of variance captured by each PC
pca_pr <- round(1*s$importance[2, ], digits = 5)
#Print out the percentages of variance
round(pca_pr*100, digits=1)
## PC1 PC2 PC3 PC4 PC5 PC6 PC7 PC8
## 53.6 16.2 9.6 7.6 5.5 3.6 2.6 1.3
pca_pr <- round(pca_pr*100, digits=1)
#Create object pc_lab to be used as axis labels
pc_lab <- paste0(names(pca_pr), " (", pca_pr, "%)")
paste0(names(pca_pr), " (", pca_pr, "%)")
## [1] "PC1 (53.6%)" "PC2 (16.2%)" "PC3 (9.6%)" "PC4 (7.6%)" "PC5 (5.5%)"
## [6] "PC6 (3.6%)" "PC7 (2.6%)" "PC8 (1.3%)"
biplot(pca_human, choices = 1:2, cex=c(0.8,1), col=c("grey40", "deeppink2"), xlab = pc_lab[1], ylab = pc_lab[2])
The results differ. The PCA plot of the unscaled data suggest that there is only one variable that would explain the variance of the two principal component and it is unilateral to PC1. After scaling, all variables are displayed clearly in the plot and pointing each in different directions and with an almost equal amplitude. The scaling reduces the difference between values of different variables and thus makes the variables comparable.
Adolescent birth rate and maternal mortality points to an increasing PC1 and Life expectancy, education expectancy and the proportion of females to males with at least secondary education points to the opposing direction. This is understandable as the opposing arrows describe phenomena of developing and developed countries, respectively. Perpendicular to these variables are the percentage of female representatives in parliament and proportion of females to male in the labour force, which are pointing to an increasing principal component PC2.
PC1 describes a dimension of health and knowledge. Variables linear to PC1 either increases wellbeing or education, which have a high correlation together as was described earlier in the correlation matrix. PC2 describes female empowerment and is independent from the themes of PC1. The correlation plot sustain this notion as variables linear with PC2 correlate positively with each others but negatively with variables linear with PC1.
library(FactoMineR)
data(tea)
str(tea)
## 'data.frame': 300 obs. of 36 variables:
## $ breakfast : Factor w/ 2 levels "breakfast","Not.breakfast": 1 1 2 2 1 2 1 2 1 1 ...
## $ tea.time : Factor w/ 2 levels "Not.tea time",..: 1 1 2 1 1 1 2 2 2 1 ...
## $ evening : Factor w/ 2 levels "evening","Not.evening": 2 2 1 2 1 2 2 1 2 1 ...
## $ lunch : Factor w/ 2 levels "lunch","Not.lunch": 2 2 2 2 2 2 2 2 2 2 ...
## $ dinner : Factor w/ 2 levels "dinner","Not.dinner": 2 2 1 1 2 1 2 2 2 2 ...
## $ always : Factor w/ 2 levels "always","Not.always": 2 2 2 2 1 2 2 2 2 2 ...
## $ home : Factor w/ 2 levels "home","Not.home": 1 1 1 1 1 1 1 1 1 1 ...
## $ work : Factor w/ 2 levels "Not.work","work": 1 1 2 1 1 1 1 1 1 1 ...
## $ tearoom : Factor w/ 2 levels "Not.tearoom",..: 1 1 1 1 1 1 1 1 1 2 ...
## $ friends : Factor w/ 2 levels "friends","Not.friends": 2 2 1 2 2 2 1 2 2 2 ...
## $ resto : Factor w/ 2 levels "Not.resto","resto": 1 1 2 1 1 1 1 1 1 1 ...
## $ pub : Factor w/ 2 levels "Not.pub","pub": 1 1 1 1 1 1 1 1 1 1 ...
## $ Tea : Factor w/ 3 levels "black","Earl Grey",..: 1 1 2 2 2 2 2 1 2 1 ...
## $ How : Factor w/ 4 levels "alone","lemon",..: 1 3 1 1 1 1 1 3 3 1 ...
## $ sugar : Factor w/ 2 levels "No.sugar","sugar": 2 1 1 2 1 1 1 1 1 1 ...
## $ how : Factor w/ 3 levels "tea bag","tea bag+unpackaged",..: 1 1 1 1 1 1 1 1 2 2 ...
## $ where : Factor w/ 3 levels "chain store",..: 1 1 1 1 1 1 1 1 2 2 ...
## $ price : Factor w/ 6 levels "p_branded","p_cheap",..: 4 6 6 6 6 3 6 6 5 5 ...
## $ age : int 39 45 47 23 48 21 37 36 40 37 ...
## $ sex : Factor w/ 2 levels "F","M": 2 1 1 2 2 2 2 1 2 2 ...
## $ SPC : Factor w/ 7 levels "employee","middle",..: 2 2 4 6 1 6 5 2 5 5 ...
## $ Sport : Factor w/ 2 levels "Not.sportsman",..: 2 2 2 1 2 2 2 2 2 1 ...
## $ age_Q : Factor w/ 5 levels "15-24","25-34",..: 3 4 4 1 4 1 3 3 3 3 ...
## $ frequency : Factor w/ 4 levels "1/day","1 to 2/week",..: 1 1 3 1 3 1 4 2 3 3 ...
## $ escape.exoticism: Factor w/ 2 levels "escape-exoticism",..: 2 1 2 1 1 2 2 2 2 2 ...
## $ spirituality : Factor w/ 2 levels "Not.spirituality",..: 1 1 1 2 2 1 1 1 1 1 ...
## $ healthy : Factor w/ 2 levels "healthy","Not.healthy": 1 1 1 1 2 1 1 1 2 1 ...
## $ diuretic : Factor w/ 2 levels "diuretic","Not.diuretic": 2 1 1 2 1 2 2 2 2 1 ...
## $ friendliness : Factor w/ 2 levels "friendliness",..: 2 2 1 2 1 2 2 1 2 1 ...
## $ iron.absorption : Factor w/ 2 levels "iron absorption",..: 2 2 2 2 2 2 2 2 2 2 ...
## $ feminine : Factor w/ 2 levels "feminine","Not.feminine": 2 2 2 2 2 2 2 1 2 2 ...
## $ sophisticated : Factor w/ 2 levels "Not.sophisticated",..: 1 1 1 2 1 1 1 2 2 1 ...
## $ slimming : Factor w/ 2 levels "No.slimming",..: 1 1 1 1 1 1 1 1 1 1 ...
## $ exciting : Factor w/ 2 levels "exciting","No.exciting": 2 1 2 2 2 2 2 2 2 2 ...
## $ relaxing : Factor w/ 2 levels "No.relaxing",..: 1 1 2 2 2 2 2 2 2 2 ...
## $ effect.on.health: Factor w/ 2 levels "effect on health",..: 2 2 2 2 2 2 2 2 2 2 ...
dim(tea)
## [1] 300 36
tea_time <- data.frame(tea$Tea, tea$How, tea$how, tea$sugar, tea$where, tea$lunch)
summary(tea_time)
## tea.Tea tea.How tea.how tea.sugar
## black : 74 alone:195 tea bag :170 No.sugar:155
## Earl Grey:193 lemon: 33 tea bag+unpackaged: 94 sugar :145
## green : 33 milk : 63 unpackaged : 36
## other: 9
## tea.where tea.lunch
## chain store :192 lunch : 44
## chain store+tea shop: 78 Not.lunch:256
## tea shop : 30
##
str(tea_time)
## 'data.frame': 300 obs. of 6 variables:
## $ tea.Tea : Factor w/ 3 levels "black","Earl Grey",..: 1 1 2 2 2 2 2 1 2 1 ...
## $ tea.How : Factor w/ 4 levels "alone","lemon",..: 1 3 1 1 1 1 1 3 3 1 ...
## $ tea.how : Factor w/ 3 levels "tea bag","tea bag+unpackaged",..: 1 1 1 1 1 1 1 1 2 2 ...
## $ tea.sugar: Factor w/ 2 levels "No.sugar","sugar": 2 1 1 2 1 1 1 1 1 1 ...
## $ tea.where: Factor w/ 3 levels "chain store",..: 1 1 1 1 1 1 1 1 2 2 ...
## $ tea.lunch: Factor w/ 2 levels "lunch","Not.lunch": 2 2 2 2 2 2 2 2 2 2 ...
gather(tea_time) %>% ggplot(aes(value)) + facet_wrap("key", scales = "free") + geom_bar() + theme(axis.text.x = element_text(angle = 45, hjust = 1, size = 8))
## Warning: attributes are not identical across measure variables; they will
## be dropped
###Run MCA
mca <- MCA(tea_time, graph = FALSE)
summary(mca)
##
## Call:
## MCA(X = tea_time, graph = FALSE)
##
##
## Eigenvalues
## Dim.1 Dim.2 Dim.3 Dim.4 Dim.5 Dim.6
## Variance 0.279 0.261 0.219 0.189 0.177 0.156
## % of var. 15.238 14.232 11.964 10.333 9.667 8.519
## Cumulative % of var. 15.238 29.471 41.435 51.768 61.434 69.953
## Dim.7 Dim.8 Dim.9 Dim.10 Dim.11
## Variance 0.144 0.141 0.117 0.087 0.062
## % of var. 7.841 7.705 6.392 4.724 3.385
## Cumulative % of var. 77.794 85.500 91.891 96.615 100.000
##
## Individuals (the 10 first)
## Dim.1 ctr cos2 Dim.2 ctr cos2 Dim.3
## 1 | -0.298 0.106 0.086 | -0.328 0.137 0.105 | -0.327
## 2 | -0.237 0.067 0.036 | -0.136 0.024 0.012 | -0.695
## 3 | -0.369 0.162 0.231 | -0.300 0.115 0.153 | -0.202
## 4 | -0.530 0.335 0.460 | -0.318 0.129 0.166 | 0.211
## 5 | -0.369 0.162 0.231 | -0.300 0.115 0.153 | -0.202
## 6 | -0.369 0.162 0.231 | -0.300 0.115 0.153 | -0.202
## 7 | -0.369 0.162 0.231 | -0.300 0.115 0.153 | -0.202
## 8 | -0.237 0.067 0.036 | -0.136 0.024 0.012 | -0.695
## 9 | 0.143 0.024 0.012 | 0.871 0.969 0.435 | -0.067
## 10 | 0.476 0.271 0.140 | 0.687 0.604 0.291 | -0.650
## ctr cos2
## 1 0.163 0.104 |
## 2 0.735 0.314 |
## 3 0.062 0.069 |
## 4 0.068 0.073 |
## 5 0.062 0.069 |
## 6 0.062 0.069 |
## 7 0.062 0.069 |
## 8 0.735 0.314 |
## 9 0.007 0.003 |
## 10 0.643 0.261 |
##
## Categories (the 10 first)
## Dim.1 ctr cos2 v.test Dim.2 ctr
## black | 0.473 3.288 0.073 4.677 | 0.094 0.139
## Earl Grey | -0.264 2.680 0.126 -6.137 | 0.123 0.626
## green | 0.486 1.547 0.029 2.952 | -0.933 6.111
## alone | -0.018 0.012 0.001 -0.418 | -0.262 2.841
## lemon | 0.669 2.938 0.055 4.068 | 0.531 1.979
## milk | -0.337 1.420 0.030 -3.002 | 0.272 0.990
## other | 0.288 0.148 0.003 0.876 | 1.820 6.347
## tea bag | -0.608 12.499 0.483 -12.023 | -0.351 4.459
## tea bag+unpackaged | 0.350 2.289 0.056 4.088 | 1.024 20.968
## unpackaged | 1.958 27.432 0.523 12.499 | -1.015 7.898
## cos2 v.test Dim.3 ctr cos2 v.test
## black 0.003 0.929 | -1.081 21.888 0.382 -10.692 |
## Earl Grey 0.027 2.867 | 0.433 9.160 0.338 10.053 |
## green 0.107 -5.669 | -0.108 0.098 0.001 -0.659 |
## alone 0.127 -6.164 | -0.113 0.627 0.024 -2.655 |
## lemon 0.035 3.226 | 1.329 14.771 0.218 8.081 |
## milk 0.020 2.422 | 0.013 0.003 0.000 0.116 |
## other 0.102 5.534 | -2.524 14.526 0.197 -7.676 |
## tea bag 0.161 -6.941 | -0.065 0.183 0.006 -1.287 |
## tea bag+unpackaged 0.478 11.956 | 0.019 0.009 0.000 0.226 |
## unpackaged 0.141 -6.482 | 0.257 0.602 0.009 1.640 |
##
## Categorical variables (eta2)
## Dim.1 Dim.2 Dim.3
## tea.Tea | 0.126 0.108 0.410 |
## tea.How | 0.076 0.190 0.394 |
## tea.how | 0.708 0.522 0.010 |
## tea.sugar | 0.065 0.001 0.336 |
## tea.where | 0.702 0.681 0.055 |
## tea.lunch | 0.000 0.064 0.111 |
plot(mca, invisible=c("ind"), habillage="quali")
The variables are distributed to a biplot of two principal dimensions explaining 15.3% and 14.2% of the total variance, respectively.